High Precision Information Extraction

نویسندگان

  • Rich Caruana
  • Paul G. Hodor
  • John Rosenberg
چکیده

ABSTRACT Most fully automatic information extraction systems achieve less than 100% extraction precision and recall. On real applications these parameters typically vary between 50% to 95%, depending on the extraction method and source data. We present an information extraction system designed for applications where little or no error can be tolerated. The system is not fully automatic. Instead, the extraction is guided by the intervention of a human expert. This \expert in the loop" approach greatly ampli es the amount of extraction an individual can accomplish, while insuring that the extraction process is nearly 100% accurate. We used a tool we created called HPIEW (for High Precision Information Extraction Workbench) to extract several different elds from the text remark elds of the Protein Data Bank (PDB). The workbench allowed us to extract each eld from more than 5,000 PDB les in an afternoon, with extraction precision and recall estimated to be greater than 99.9%. We believe this approach may be useful for other extraction problems where extreme accuracy is required.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Information Extraction as a Basis for High-precision Text Classiication

We describe an approach to text classiication that represents a compromise between traditional word-based techniques and in-depth natural language processing. Our approach uses a natural language processing task called information extraction as a basis for high-precision text classiication. We present three algorithms that use varying amounts of extracted information to classify texts. The rele...

متن کامل

Design and evaluation of an ontology based information extraction system for radiological reports

This paper describes an information extraction system that extracts and converts the available information in free text Turkish radiology reports into a structured information model using manually created extraction rules and domain ontology. The ontology provides flexibility in the design of extraction rules, and determines the information model for the extracted semantic information. Although...

متن کامل

Address extraction using hidden Markov models

This paper presents the implementation and evaluation of a Hidden Markov Model to extract addresses from OCR text. Although Hidden Markov Models discover addresses with high precision and recall, this type of Information Extraction task seems to be affected negatively by the presence of OCR text.

متن کامل

Active Learning Selection Strategies for Information Extraction

The need for labeled documents is a key bottleneck in adaptive information extraction. One way to solve this problem is through active learning algorithms that require users to label only the most informative documents. We investigate several document selection strategies that are particularly relevant to information extraction. We show that some strategies are biased toward recall, while other...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000